library(tidyverse)
library(readxl)

Dataset and description

KOMP Plasma Metabolomics Dataset

Mouse knockouts facilitate the study of gene functions. Often, multiple abnormal phenotypes are induced when a gene is inactivated. The International Mouse Phenotyping Consortium (IMPC) has generated thousands of mouse knockouts and cataloged their phenotype data. We have acquired metabolomics data from 220 plasma samples from 30 unique mouse gene knockouts and corresponding wild-type mice from the IMPC. To acquire comprehensive metabolomics data, we have used liquid chromatography (LC) combined with mass spectrometry (MS) for detecting polar and lipophilic compounds in an untargeted approach. We have also used targeted methods to measure bile acids, steroids and oxylipins. In addition, we have used gas chromatography GC-TOFMS for measuring primary metabolites. The metabolomics dataset reports 832 unique structurally identified metabolites from 124 chemical classes as determined by ChemRICH software

In this demo we will use only the metabolites quantified in the targeted assays. The belong to the chemical classes of bile acids, steroids and oxylipins

Here we get the data

load("../data/KOMP_data_targeted.RData")

We have now three files

Metabolite Description

The amount of info in the table is uncommon

colnames(metabolite_meta)
##  [1] "CompoundID"         "Order"              "DataBaseID"        
##  [4] "Type"               "Assay"              "AnnotationApproach"
##  [7] "CompoundName"       "IKFirstBlock"       "InChiKey"          
## [10] "RetentionTime"      "MZ"                 "SMILES"            
## [13] "KEGGID"             "PubChemID"          "QC_RSD"            
## [16] "BlankRatio"         "AcylChain"

What is interesting for us is to see how much effort has been made to precisely identify the compounds. The problem of “naming” is typical of analytical chemistry and metabolomics.

metabolite_meta$InChiKey[1]
## [1] "RUDATBOHQWOJDD-BSWAIDMHSA-N"

Some of the identifier refer to publicly available databases. This is the general trend of this scientific discipline.

KEGGID refers to kegg so it can be used to link the metabolite to its position within metabolic networks.

Sample Description

The second tibble contains the description of the sample metadata

head(sample_meta_data)
## # A tibble: 6 × 4
##   MouseID Genotype Gender Zygosity
##     <dbl> <chr>    <chr>  <chr>   
## 1  413741 Pebp1    Male   Homo-   
## 2  413742 Pebp1    Male   Homo-   
## 3  413745 Pebp1    Male   Homo-   
## 4  428689 Pebp1    Female Homo-   
## 5  428693 Pebp1    Female Homo-   
## 6  433002 Pebp1    Female Homo-

The field are auto explanatory. The MouseID is the code we need to associate the sample to the class

First of all we give a look to the experimental design, just to understand the numbers we are dealing with

table(sample_meta_data$Gender,sample_meta_data$Genotype)
##         
##          A2m Ahcy Atp5b Atp6v0d1 C8a Cdk4 Dhfr Dync1li1 G6pd2 Galc Gnpda1 Idh1
##   Female   3    3     3        3   3    3    3        3     3    3      3    3
##   Male     3    3     3        3   3    3    3        3     3    3      3    3
##         
##          Iqgap1 Lmbrd1 Mfap4 Mmachc Mvk Nek2 Npc2 Null Pebp1 Phyh Pipox Plk1
##   Female      3      3     3      3   3    3    3   20     3    3     3    3
##   Male        3      3     3      3   3    3    3   20     3    3     3    3
##         
##          Pmm2 Ptpn12 Pttg1 Rock1 Sra1 Ulk3 Ywhaz
##   Female    3      3     3     3    3    3     3
##   Male      3      3     3     3    3    3     3

The Data Matrix

head(DM[,1:20])
## # A tibble: 6 × 20
##   CompoundID KOMP_413741 KOMP_…¹ KOMP_…² KOMP_…³ KOMP_…⁴ KOMP_…⁵ KOMP_…⁶ KOMP_…⁷
##   <chr>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1 CPD000001        124.    50.6     67.0  227.    101.    357.     151.    38.1 
## 2 CPD000002         63.7   88.3     38.9   74.9    16.0   126.     238.    44.8 
## 3 CPD000003         15.1   64.0     21.8   84.7    69.4   186.      19.2  183.  
## 4 CPD000004        116.   549.      71.6  195.    526.    146.     295.   179.  
## 5 CPD000005         NA     NA       NA      0.55    0.87   NA       NA     NA   
## 6 CPD000006         NA      0.53     1     NA       1.49    6.93    NA      1.68
## # … with 11 more variables: KOMP_115996 <dbl>, KOMP_138110 <dbl>,
## #   KOMP_138120 <dbl>, KOMP_190701 <dbl>, KOMP_143866 <dbl>, KOMP_143867 <dbl>,
## #   KOMP_143869 <dbl>, KOMP_144064 <dbl>, KOMP_123951 <dbl>, KOMP_123953 <dbl>,
## #   KOMP_487464 <dbl>, and abbreviated variable names ¹​KOMP_413742,
## #   ²​KOMP_413745, ³​KOMP_428689, ⁴​KOMP_428693, ⁵​KOMP_433002, ⁶​KOMP_185262,
## #   ⁷​KOMP_115900

Since our mouse ID is a number without the leading “KOMP_” we fix this on the fly

DM <- DM %>% 
  as.data.frame() %>% 
  column_to_rownames("CompoundID") %>% 
  t(.) %>% 
  as_tibble(rownames = "Sampname") %>% 
  separate(Sampname, c("One","MouseID"),"_") %>% ## split the column
  dplyr::select(-One) %>%   ## remove the first piece
  mutate(MouseID = as.numeric(MouseID))

Now the matrix looks like this

head(DM[,1:20])
## # A tibble: 6 × 20
##   MouseID CPD000001 CPD000002 CPD000003 CPD000…¹ CPD00…² CPD00…³ CPD00…⁴ CPD00…⁵
##     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
## 1  413741     124.       63.7      15.1    116.    NA      NA      140.    15.4 
## 2  413742      50.6      88.3      64.0    549.    NA       0.53   325.    21.1 
## 3  413745      67.0      38.9      21.8     71.6   NA       1       59.0    5.15
## 4  428689     227.       74.9      84.7    195.     0.55   NA      110     13.1 
## 5  428693     101.       16.0      69.4    526.     0.87    1.49   318.    42.4 
## 6  433002     357.      126.      186.     146.    NA       6.93   115.    13.1 
## # … with 11 more variables: CPD000009 <dbl>, CPD000010 <dbl>, CPD000011 <dbl>,
## #   CPD000012 <dbl>, CPD000013 <dbl>, CPD000014 <dbl>, CPD000015 <dbl>,
## #   CPD000016 <dbl>, CPD000792 <dbl>, CPD000793 <dbl>, CPD000794 <dbl>, and
## #   abbreviated variable names ¹​CPD000004, ²​CPD000005, ³​CPD000006, ⁴​CPD000007,
## #   ⁵​CPD000008

To check that we organize the matrix in a different way also joining the concentration data with the sample metadata

DM_nest <- DM %>% 
  left_join(sample_meta_data) %>% 
  pivot_longer(starts_with("CPD"), names_to = "CompoundID", values_to = "I") %>% 
  group_by(CompoundID) %>% 
  nest()

This data structure highlights the role of each metabolite

head(DM_nest)
## # A tibble: 6 × 2
## # Groups:   CompoundID [6]
##   CompoundID data              
##   <chr>      <list>            
## 1 CPD000001  <tibble [220 × 5]>
## 2 CPD000002  <tibble [220 × 5]>
## 3 CPD000003  <tibble [220 × 5]>
## 4 CPD000004  <tibble [220 × 5]>
## 5 CPD000005  <tibble [220 × 5]>
## 6 CPD000006  <tibble [220 × 5]>

the intensities of each metabolite are stored in the “data” column

head(DM_nest$data[[1]])
## # A tibble: 6 × 5
##   MouseID Genotype Gender Zygosity     I
##     <dbl> <chr>    <chr>  <chr>    <dbl>
## 1  413741 Pebp1    Male   Homo-    124. 
## 2  413742 Pebp1    Male   Homo-     50.6
## 3  413745 Pebp1    Male   Homo-     67.0
## 4  428689 Pebp1    Female Homo-    227. 
## 5  428693 Pebp1    Female Homo-    101. 
## 6  433002 Pebp1    Female Homo-    357.

The new tibble can be efficiently used to inspect the data in a univariate perspective …

Missing Values

For example, the number of NAs in relation of the median intensity can be visualized as

DM_nest %>% 
  mutate(nnas = map_dbl(data,function(x) sum(is.na(x$I))),                  ## calculate the univariate statistics "on the fly"
         medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>% 
  filter(nnas > 0) %>% ## we keep only features with at least one NA!
  ggplot() +   ## plot them
  geom_point(aes(x = medI, y = nnas)) + 
  scale_x_log10() + 
  theme_light()

Which actually shows what we expected. High number of missing values are visible only with low median intensities … note the log scale in x.

This is telling us that there is a distinction of metabolites in two groups showing a marked difference in concentration

Is this effect dependent on the class of metabolite?

DM_nest %>% 
  left_join(metabolite_meta) %>% 
  mutate(nnas = map_dbl(data,~sum(is.na(.x$I))),                  ## calculate the univariate statistics "on the fly"
         medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>% 
  filter(nnas > 0) %>% ## we keep only features with at leas on NA!
  ggplot(aes(x = medI, y = nnas, col = Assay)) +   ## plot them
  geom_point() + 
  scale_x_log10() + 
  scale_color_brewer(palette = "Set2") + 
  theme_light()

That’s strange … the largest number of NAs is corresponding to lower intensities for the Bile Acids_Steroids and not for Oxylipins. This is curious

It is worth highlighting the name of the metabolites showing a number of NAs larger than 50 …

DM_nest %>% 
  left_join(metabolite_meta) %>% 
  mutate(nnas = map_dbl(data,~sum(is.na(.x$I))),                  ## calculate the univariate statistics "on the fly"
         medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>% 
  filter(nnas > 0) %>% ## we keep only features with at leas on NA!
  ggplot(aes(x = medI, y = nnas, col = Assay)) +   ## plot them
  geom_point(size = 2) + 
  geom_text(aes(label=ifelse(nnas>50,CompoundName,'')), size = 4, nudge_y = -3) + 
  scale_x_log10() + 
  scale_color_brewer(palette = "Set2") + 
  theme_light()

As we discussed in the lecture, it is difficult to decide which metabolites have too many missing values …

When we have a “limited” number of variables the best approach is to plot the data. With the DM_nest we can do that quite efficiently

To do that I create a custom plotting function which takes as input the tibble of the data and a potential title

myplot <- function(t,l = ""){
  p <- t %>% 
  ggplot() + 
  geom_jitter(aes(x = Genotype, y = I, color = Gender, shape = Zygosity), width = 0.1) + 
  scale_y_log10() + 
  scale_color_brewer(palette = "Set1") + 
  theme_light() + 
  ggtitle(l) + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
  
  p
}


myplot(DM_nest$data[[1]], "test")

And I recursively apply it to the nested data tibbles …

DM_nest <- DM_nest %>%
  left_join(metabolite_meta %>% dplyr::select(CompoundID,CompoundName)) %>% 
  mutate(rawplots = map2(data,CompoundName, ~myplot(.x,.y)))

Now I can visualize (or save) the plot I want …

This one, in particular show the trend of one of the features with a large fraction of NAs

DM_nest$rawplots[[5]]

As it can be expected, progesterone is lower in males, and, most likely close to the detection limit. It is worth remarking that progesterone was the steroid showing the lower intensity. Its behavior is then not odd.

Let’s now look to PGD2

DM_nest$rawplots[[22]]

The trend of this metabolite is really wired. Its large fraction of missing values do not seems to be associated to the intensity of the signal

As far as I can see (but you could check better … ;-) ), the distribution of the missing data does not seems to be higly associated with the study factors.

Considering that we have 220 samples the level of missingness is never extremely severe, so when it will be needed imputation should not be something that strongly influence the

Something For you

  1. play around trying to visualize the other metabolites showing high levels of missingness

Variable Distribution

The need of displaying the intensity in log scale is already suggesting that the distribution of the intensities is far from being normal. This is not unexpected in metabolomics. In real life what i do is to work almost invariably with log-transformed datasets.

For sure

  1. The low number of samples for each genotypes makes a class based assessment completely hopeless
  2. What we could do is to look to the different variables highlighting the gender and the zygosity.

We can follow the spirit of the previous plot by defining a function and using it to produce a full list of plots …

An alternative could be to unnest the data and rely on faceting …

DM_nest %>% 
  ungroup() %>% 
  dplyr::select(CompoundName,data) %>% 
  unnest(data) %>% 
  ggplot(aes(sample = I)) +
  stat_qq(aes(col = Gender, shape = Zygosity)) + 
  scale_color_brewer(palette = "Set1") + 
  stat_qq_line() + 
  facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
  theme_light() + 
  theme(aspect.ratio = 1, 
        legend.position = "bottom")

Ok, the plot is “ugly” (maybe creating a set of plots as we did before could have been more pleasant), but is really informative.

DM_nest$rawplots[[50]]

The interpretation for that would, most likely require a chat with a good biochemist…

Let’s look to the global effect of a log transformation on the q-q plots

DM_nest %>% 
  ungroup() %>% 
  dplyr::select(CompoundName,data) %>% 
  unnest(data) %>% 
  ggplot(aes(sample = log10(I))) +
  stat_qq(aes(col = Gender, shape = Zygosity)) + 
  scale_color_brewer(palette = "Set1") + 
  stat_qq_line() + 
  facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
  theme_light() + 
  theme(aspect.ratio = 1, 
        legend.position = "bottom")

The situation largely improves, even if some compounds are still showing strange behaviours …

Something For you

  1. Could you look to the metabolites showing outlying behaviours? Is this a matter of genotype?

Multivariate Visualization - PCA

The previous univariate journey should have been already pointing out some interesting aspects of the dataset.

The subsequent step is to look to visualize the multivariate structure of the data to spot larger scale patterns.

In order to do that it is necessary to impute the data matrix, substituting NAs with meaningful numbers.

  1. The number should have a variability
  2. The number should not know anything about the experimental design
  3. The number should have a reasonable analytical meaning

My choice here is to work variable wise and replace the NAs with a random number drawn from an uniform distribution spanning from 0 to the minimum value measured for that variable

The rationale behind this choice is that

In my practical example imputation could be perfoed both in the DM and in its nested version DM_nest. Since we are going to do PCA, I’ll work on the unnested data matrix.

First of all I need a function to perform the imputation of a vector

myimputer <- function(v){
  if (sum(is.na(v)) == 0) {  ## If I do not have NAs please leave the vector unchanged
    return(v)
  } else {
    napos <- which(is.na(v))  ## position of the NAs in the vector
    newval <- runif(length(napos),0,min(v, na.rm = TRUE))  ## calculate the random numbers I should put in place of the NAs
    out <- v
    out[napos] <- newval
    return(out)
  }
  
}

Now we apply it to the full set of columns

DM_i <- DM %>% 
  mutate(across(starts_with("CPD"), ~myimputer(.x)))

As we have seen, log transformation is helpful …

DM_i <- DM_i %>% 
  mutate(across(starts_with("CPD"), ~log10(.x)))

And now PCA!

library(FactoMineR)
library(factoextra)
myPCA <- PCA(DM_i %>% dplyr::select(starts_with("CPD")), graph = FALSE)
fviz_pca_ind(myPCA, 
             habillage = factor(sample_meta_data$Gender), 
             geom = "point", 
             pointsize = 2,
             axes = c(1,2)) + 
  scale_color_brewer(palette = "Set1")

Even if the amount of variability captured in this representation is only the 30%, the previous plot shows the presence of a separation between males and females.

This is not unexpected considering the results of our univariate investigation

What about the variables?

fviz_pca_biplot(myPCA, 
             habillage = factor(sample_meta_data$Gender), 
             geom = "point", 
             pointsize = 2,
             axes = c(1,2)) + 
  scale_color_brewer(palette = "Set1")

The biplot here shows that females show an higher level of the large majority of the metabolites.

Since the partial separation is visible along PC1 we could visualize the variables which show the higher contribution to that direction

fviz_contrib(myPCA, "var", axes = 1)

These plots are really informative, let’s give a look inside the PCA object to see how these number could be get out.

This list contains the information about the samples (the scores) so it can be used to reconstruct the score plot. What one needs are the coordinates

str(myPCA$ind)
## List of 4
##  $ coord  : num [1:220, 1:5] 1.25 -2.52 -3.23 1.56 3.72 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:220] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ cos2   : num [1:220, 1:5] 0.0551 0.1763 0.1952 0.0628 0.1714 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:220] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ contrib: num [1:220, 1:5] 0.06 0.2419 0.3995 0.0924 0.5275 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:220] "1" "2" "3" "4" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ dist   : Named num [1:220] 5.34 5.99 7.32 6.21 8.98 ...
##   ..- attr(*, "names")= chr [1:220] "1" "2" "3" "4" ...

The dist and cos2 can be used to assess how far a sample is from the PCA plane (or projection). This is important to identify potential outliers.

The list

str(myPCA$var)
## List of 4
##  $ coord  : num [1:57, 1:5] 0.501 0.438 0.224 0.614 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ cor    : num [1:57, 1:5] 0.501 0.438 0.224 0.614 0.241 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ cos2   : num [1:57, 1:5] 0.2515 0.1916 0.05 0.3773 0.0583 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ contrib: num [1:57, 1:5] 2.11 1.61 0.42 3.17 0.49 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
##   .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...

instead contains the info about the variables, their loadings and the correlation with the principal components.

To extract the previous information in a more programmatic way, FactoMineR has a specific function

head(facto_summarize(myPCA, "var", axes = 1))
##                name      Dim.1        coord         cos2     contrib
## CPD000001 CPD000001 0.50146696 0.2514691117 0.2514691117 2.113225335
## CPD000002 CPD000002 0.43771074 0.1915906960 0.1915906960 1.610035960
## CPD000003 CPD000003 0.22358750 0.0499913695 0.0499913695 0.420103399
## CPD000004 CPD000004 0.61422802 0.3772760659 0.3772760659 3.170446403
## CPD000005 CPD000005 0.24146089 0.0583033611 0.0583033611 0.489953374
## CPD000006 CPD000006 0.02785031 0.0007756396 0.0007756396 0.006518101

which returns as a table the content of the PCA object.

By using the content of the myPCA object it is possible to “reconstruct” the PCA plots with ggplot and this can be handy if specific representation of the data are needed.

Suppose, for example, that you would like to construct a more informative variable importance plot which shows the correct metabolite name and also color the bars according to the chemical/metabolic class of the compound.

facto_summarize(myPCA, "var", axes = 1) %>% 
  as_tibble() %>% 
  dplyr::select(name, `Dim.1`) %>% 
  left_join(metabolite_meta, by = c("name" = "CompoundID")) %>% 
  arrange(`Dim.1`) %>% 
  mutate(CompoundName = factor(CompoundName, levels = CompoundName)) %>% 
  ggplot() + 
  geom_point(aes(x = CompoundName,  y= `Dim.1`, col = Assay), size = 2) + 
  geom_segment(aes(x = CompoundName,  yend = `Dim.1`, col = Assay, xend = CompoundName, y = 0)) + 
  scale_color_brewer(palette = "Set2") + 
  coord_flip() + 
  theme_light()

Which can be of help for the interpretation of the results

Something For you

  1. Could you try a different imputation strategy? Is this affecting the results of your exploratory data analysis?
  2. Just play and ask!

Biomarker Discovery

What we did so far is called exploratory data analysis. To write the paper we should also find the variables which are showing a significant effect of the design factors.

Notes

  1. One could look for biomarkers with univariate and multivariate methods.
  2. In a univariate perspective, to avoid the risks of non matching the assumption of many statistical tests it would be safer to rely on non-parametric approaches
  3. As far as the difference in genotype is concerned, the number of samples is limited. Most likely, there we should use parametric tests
  4. Since we are performing multiple tests, we should remember that false positives will be present!

For illustrative purposes let’s focus on the Female-Male Comparison.

## here i nest the imputed data matrix, most likely I could do the testing also in the "non" imputed one
## we also add the gender of the animal
DM_i_nest <- DM_i %>% 
  pivot_longer(starts_with("CPD"), names_to = "CompoundID", values_to = "I") %>% 
  left_join(sample_meta_data) %>% 
  nest(-c(CompoundID))

Now we run the tests on the metabolites

DM_i_nest <- DM_i_nest %>% 
  mutate(wilcoxon = map(data, ~wilcox.test(I ~ Gender, data = .x, exact = FALSE)))

If we want to see one of the tests …

DM_i_nest$wilcoxon[[10]]
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  I by Gender
## W = 7374.5, p-value = 0.005036
## alternative hypothesis: true location shift is not equal to 0

Now we extract the p-values

DM_i_nest <- DM_i_nest %>% 
  mutate(ps = map_dbl(wilcoxon, ~.x$p.value))

As we have seen yesterday, a first idea on the presence of “relevant” differences in the two classes is the distribution of the p-values

hist(DM_i_nest$ps)

Create a set of data with permuted intensities

DM_i_nest <- DM_i_nest %>% 
  mutate(perm_data = map(data, function(t)  t %>% mutate(I = sample(I)))) %>% 
  mutate(perm_wilcoxon = map(perm_data, ~wilcox.test(I ~ Gender, data = .x, exact = FALSE))) %>% 
  mutate(perm_ps = map_dbl(perm_wilcoxon, ~.x$p.value))
hist(DM_i_nest$perm_ps)

This is clearly not uniform and it speaks of the presence of a significant fraction of differences between males and females.

Now we would like to rank the variables on the base of their contrast on the two classes … but as we know it is not correct to use the p-value for that.

We need to calculate the effect size! Now I’ll use the Cohen’s d. A word of caution here, this type of effect size is parametric … but I’ve been stressing that non parametric analysis here should be preferred. Right. A non parametric measure of the effect size should be better. A possibility could be to calculate the ratio between the difference between the two medians and divide it by the larger interquartile range. You could be creative there …

Let’s stick to Cohen’s d. In the same time I’ll also calculate the p-values corrected for multiplicity

library(effsize)

DM_i_nest <- DM_i_nest %>%  
  mutate(pcorr = p.adjust(ps, method = "bonferroni")) %>% 
  mutate(cohend = map_dbl(data, ~cohen.d(I ~ Gender, data = .x)$estimate))

Now we use the effect size and the p value to create a volcano plot

library(plotly)
ggplotly(DM_i_nest %>% 
           left_join(metabolite_meta %>% select(CompoundID,CompoundName)) %>% 
           ggplot(aes(text=CompoundName)) + 
           geom_hline(yintercept = 5, lty = 2, col = "red") + 
           geom_point(aes(x = cohend, y = -log10(ps)), fill = "orange", size = 3, pch = 21) + 
           theme_bw(), tooltip="text")

Now we focus on the compound which are significant at the 0.01 level and we sort the results by effect size …

markers <- DM_i_nest %>% 
  filter(pcorr < 0.01) %>% 
  arrange(desc(abs(cohend))) %>% 
  left_join(metabolite_meta)

head(markers)
## # A tibble: 6 × 25
##   CompoundID data     wilcoxon       ps perm_d…¹ perm_…² perm_ps    pcorr cohend
##   <chr>      <list>   <list>      <dbl> <list>   <list>    <dbl>    <dbl>  <dbl>
## 1 CPD000829  <tibble> <htest>  8.22e-30 <tibble> <htest>   0.353 4.69e-28  1.66 
## 2 CPD000830  <tibble> <htest>  5.21e-21 <tibble> <htest>   0.733 2.97e-19  1.38 
## 3 CPD000807  <tibble> <htest>  1.84e-19 <tibble> <htest>   0.265 1.05e-17  1.30 
## 4 CPD000808  <tibble> <htest>  4.90e-17 <tibble> <htest>   0.638 2.79e-15  1.18 
## 5 CPD000005  <tibble> <htest>  1.92e-14 <tibble> <htest>   0.144 1.09e-12  1.17 
## 6 CPD000806  <tibble> <htest>  8.43e-12 <tibble> <htest>   0.555 4.81e-10  0.768
## # … with 16 more variables: Order <dbl>, DataBaseID <dbl>, Type <chr>,
## #   Assay <chr>, AnnotationApproach <chr>, CompoundName <chr>,
## #   IKFirstBlock <chr>, InChiKey <chr>, RetentionTime <dbl>, MZ <dbl>,
## #   SMILES <chr>, KEGGID <chr>, PubChemID <chr>, QC_RSD <dbl>,
## #   BlankRatio <dbl>, AcylChain <chr>, and abbreviated variable names
## #   ¹​perm_data, ²​perm_wilcoxon

This is the list of “biomarkers”, let’s plot one of them …

DM_nest$rawplots[[54]]

Which exactly shows what we expect.

Notes

  1. This is the metabolite showing the larger “contrast” so our interpretation could start from there.
  2. This is also the metabolite showing up in the PCA ;-)

Variable Correlation

The association between the intensity of two metabolites in the samples is often considered an interesting evidence in the perspective of the interpretation. Metabolites are indeed linked in metabolic pathways and their concentrations are not independent

As we have seen yesterday, however, correlation can be there only by chance so care have to be taken to jump from the observation of an association to the conclusion of a causal relation.

These caveats are particularly true in the case of untargeted metabolomics assays and can be somehow taken more easily as far as targeted metabolomics is concerned.

The best tool to inspect variable correlation in R is the pairs plot. With pairs one really sees the experimental data and it is easy to spot subgroups and outlying samples.

Unfortunately, the pairs visualization is manageable only with a small number of variables (something like 15). For this reason it is a resource which can be exploited after the identification of the most relevant biomarkers

## here I get the names of the first 5
focuscp <- markers %>% 
  slice(1:5) %>% 
  pull(CompoundID)
## This is a littel of magic ...
mypalette <- c("#F24D2970","#1C366B70")
names(mypalette) <- unique(sample_meta_data$Gender)

par(pty="s")

DM_i %>% 
  dplyr::select(all_of(focuscp)) %>% 
  as.matrix(.) %>% 
  pairs(., col = mypalette[sample_meta_data$Gender], pch = 19)

Notes

  1. We see correlations between the compounds. The biochemist will be happy. In particular we see an indication of a change in correlation between different oxylipins. This should be checked against metabolic pathways.
  2. It would be interesting to use the previous approach to see if some of the outlying samples are associated to a specific genotype

Something For you

  1. Could you try to use ANOVA to look to metabolites potentially different in the genotypes?
  2. Do you think that the ANOVA assumption here are fulfilled?
  3. Run a similar analysis on the rubus targeted data